Add support for fp16 iGEMM with SME2 #9005

JonathanC-ARM · 2025-10-20T08:30:18Z

Initial prototype for FP16 Igemm support for SME2
continuing work from ##8687

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]>

Signed-off-by: Gian Marco Iodice <[email protected]>

Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet · 2025-10-21T04:01:46Z

This isn't building for us:

test/gemm-microkernel-tester.cc:2455:40: error: no viable overloaded '+='
 2455 |         c_ref[m_index * n() + n_index] +=
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
 2456 |             xnn_float16_to_float(input_f16[m_index * k() + k_index]) *
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 2457 |             xnn_float16_to_float(weights[n_index * k() + k_index]);
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test/gemm-microkernel-tester.cc:2459:38: error: no viable overloaded '+='
 2459 |       c_ref[m_index * n() + n_index] += xnn_float16_to_float(bias[n_index]);
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.

JonathanC-ARM · 2025-10-21T11:14:24Z

Hi @dsharlet could you give a bit more context of the error, particularly around the build command used. The strange thing on my end is that I cant see this.

I am compiling on an M4 however, but I'm going to try on an x86_64 machine shortly and cross compile

bazel build -c opt --enable_bzlmod --define xnn_enable_arm_sme=true --define xnn_enable_arm_sme2=true //test:gemm_microkernel_tester

Tried a few variations on the command, cleaned my environment etc. Also synced my fork with master in case anything since.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 69ccf09 PiperOrigin-RevId: 821598958

No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068

…ally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723

…graph PiperOrigin-RevId: 821566586

PiperOrigin-RevId: 821694771

This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594

PiperOrigin-RevId: 821708108

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217

PiperOrigin-RevId: 821808685

PiperOrigin-RevId: 821857188

PiperOrigin-RevId: 821867761

PiperOrigin-RevId: 821984759

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-10-21T18:06:12Z

@dsharlet thanks for telling me about the build problem, seemed to only show up on Linux machines. I was able to fix the build issue in the latest commit.

Will be resolving the conflicts with Master shortly.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 56ee7cb PiperOrigin-RevId: 821598958

gonnet · 2025-10-22T12:51:41Z

This is still failing to build for the CI workflows, e.g. https://github.com/google/XNNPACK/actions/runs/18713631848/job/53367695764.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> -- 06a44d2 by Jonathan Clohessy <[email protected]>: Refactor Convolution to new structure and fix build failures Signed-off-by: Jonathan Clohessy <[email protected]> -- 175903d by Jonathan Clohessy <[email protected]>: Remove unused gemm config structure init Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 9efa3d6 PiperOrigin-RevId: 821598958

Signed-off-by: Jonathan Clohessy <[email protected]>

…_igemm

JonathanC-ARM · 2025-10-29T14:41:59Z

Hi @dsharlet I made some additional changes, and ran all of //test/... with sme2 on/off and vice versa. Everything seemed to pass testing, I was able to replicate the original failures and work through them. So I think it should be all good now.

Thanks

Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet · 2025-10-30T01:47:35Z

There were some issues with build timeouts earlier. I re-ran the failed builds, there is a remaining real build issue:

C:\Users\runneradmin\.cargo\bin\ccache.exe C:\PROGRA~1\MICROS~2\2022\ENTERP~1\VC\Tools\MSVC\1444~1.352\bin\Hostx64\arm64\cl.exe  /nologo /TP -DNOMINMAX -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ARM_BF16=0 -DXNN_ENABLE_ARM_DOTPROD=1 -DXNN_ENABLE_ARM_FP16_SCALAR=0 -DXNN_ENABLE_ARM_FP16_VECTOR=1 -DXNN_ENABLE_ARM_I8MM=1 -DXNN_ENABLE_ARM_SME2=0 -DXNN_ENABLE_ARM_SME=1 -DXNN_ENABLE_ASSEMBLY=0 -DXNN_ENABLE_AVX256SKX=1 -DXNN_ENABLE_AVX256VNNI=1 -DXNN_ENABLE_AVX256VNNIGFNI=1 -DXNN_ENABLE_AVX2=1 -DXNN_ENABLE_AVX512AMX=1 -DXNN_ENABLE_AVX512BF16=0 -DXNN_ENABLE_AVX512F=1 -DXNN_ENABLE_AVX512FP16=0 -DXNN_ENABLE_AVX512SKX=1 -DXNN_ENABLE_AVX512VBMI=1 -DXNN_ENABLE_AVX512VNNI=1 -DXNN_ENABLE_AVX512VNNIGFNI=1 -DXNN_ENABLE_AVX=1 -DXNN_ENABLE_AVXVNNI=1 -DXNN_ENABLE_AVXVNNIINT8=0 -DXNN_ENABLE_CPUINFO=1 -DXNN_ENABLE_F16C=1 -DXNN_ENABLE_FMA3=1 -DXNN_ENABLE_HVX=1 -DXNN_ENABLE_KLEIDIAI=0 -DXNN_ENABLE_RISCV_VECTOR=1 -DXNN_ENABLE_SPARSE=1 -DXNN_ENABLE_SSE2=1 -DXNN_ENABLE_SSE41=1 -DXNN_ENABLE_SSE=1 -DXNN_ENABLE_SSSE3=1 -DXNN_ENABLE_VSX=1 -DXNN_ENABLE_WASM_REVECTORIZE=0 -DXNN_LOG_LEVEL=0 -IC:\a\XNNPACK\XNNPACK\include -IC:\a\XNNPACK\XNNPACK\build\windows\arm64\pthreadpool-source\include -external:IC:\a\XNNPACK\XNNPACK\. -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest -external:W0 /UNDEBUG  /DWIN32 /D_WINDOWS /GR /EHsc /O2 /Ob2 /DNDEBUG -std:c++14 -MD /wd4146 /bigobj /wd4190 /O2 /DEBUG:FASTLINK /Zi /showIncludes /Fotest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.cc.obj /Fdtest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.pdb /FS -c C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2864): error C3861: 'xnn_packed_size_kai_f16_conv_goki_w': identifier not found
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2870): error C3861: 'xnn_pack_kai_f16_conv_goki_w_sme': identifier not found

JonathanC-ARM · 2025-10-30T11:41:39Z

I just made some small tweaks for ifdef's which meant this stuff was getting into non kleidi builds.
bazel test --compilation_mode=opt --define xnn_enable_assembly=false --define xnn_enable_arm_fp16_scalar=false --define xnn_enable_arm_bf16=false --define xnn_enable_kleidiai=false //test/... Was able to see the failure resolved it and from what I was able to test on my end it should be working now.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> -- 06a44d2 by Jonathan Clohessy <[email protected]>: Refactor Convolution to new structure and fix build failures Signed-off-by: Jonathan Clohessy <[email protected]> -- 175903d by Jonathan Clohessy <[email protected]>: Remove unused gemm config structure init Signed-off-by: Jonathan Clohessy <[email protected]> -- 999f4e3 by Jonathan Clohessy <[email protected]>: Updated code with sme variants of kernels and fixed tests Signed-off-by: Jonathan Clohessy <[email protected]> -- a2bd7aa by Jonathan Clohessy <[email protected]>: Updated ifdef guards and yml file Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm a2bd7aa PiperOrigin-RevId: 821598958

dsharlet · 2025-11-02T16:15:59Z

This is crashing in some of our tests. It seems like the symptom is relatively simple, this function pointer call:

XNNPACK/src/operator-run.c

Line 1004 in 48c1b36

context->ukernel.function[XNN_UARCH_DEFAULT](

Is trying to call this function: https://github.com/google/XNNPACK/pull/9005/files#diff-017fa5d842d8909aebb30be2f3f22e7c785374b4c7e1b26ac6ad1a5853efc794R40-R43

The arguments don't match, and one of the arguments that doesn't match is params which is getting passed context->cn_stride => crash.

I suspect the problem is that we are missing an LH packing config here?

XNNPACK/src/operators/convolution-nhwc.c

Lines 2151 to 2155 in 48c1b36

    
           case xnn_operator_type_convolution_nhwc_pqs8_qs8_qc8w: 
        
             if (inline_lhs_packing) { 
        
               packed_lh_config = xnn_init_x8_igemm_pack_lh_config(); 
        
             } 
        
             break;

The presence of that seems like it will affect which operator-run code will execute.

Aside from the problem itself, I'm concerned about why our operator or subgraph tests didn't catch this? We need to make sure we have test coverage from the subgraph API for this before we merge it.

I'm investigating a bug from #9005, and discovered that many of these codepaths are rarely or never exercised by our tests, because there are just too many of them (and some of them are simply dead code). The probability of groups = 1 and batches = 1 in a randomized test is low, and when running on slow emulators, we don't get many chances. We also have different code for ARM vs. not-ARM (via `XNN_MAX_UARCH_TYPES`), which again forks our test coverage. I don't think these specializations are worth the cost (the constant vigilance required to ensure we don't lose test coverage of all of these paths). PiperOrigin-RevId: 826767227

dsharlet · 2025-11-02T16:20:34Z

I sent #9005 which attempts to reduce the number of codepaths that are relevant here. When I was trying to investigate the crash, I found that there are many different codepaths that could be used, and we only test a few of them. However, I don't believe that addresses the bug in this case.

I'm investigating a bug from #9005, and discovered that many of these codepaths are rarely or never exercised by our tests, because there are just too many of them (and some of them are simply dead code). The probability of groups = 1 and batches = 1 in a randomized test is low, and when running on slow emulators, we don't get many chances. We also have different code for ARM vs. not-ARM (via `XNN_MAX_UARCH_TYPES`), which again forks our test coverage. I don't think these specializations are worth the cost (the constant vigilance required to ensure we don't lose test coverage of all of these paths). PiperOrigin-RevId: 826767227

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM · 2025-11-17T15:13:41Z

Hi @dsharlet I've made some changes and added a new test case, which exercises the failing code path. So the test passes currently but if you comment out the following in src/operators/convolution-nhwc.c line 2200ish as such

   case xnn_operator_type_convolution_nhwc_pf16:
        if (inline_lhs_packing) {
          // packed_lh_config = xnn_init_x16_igemm_pack_lh_config();
        }
        break

This will cause the test to segmentation fault for the reason you described previously. This update should now be passing all tests that I am aware of.

dsharlet · 2025-11-17T17:38:21Z

test/subgraph/subgraph-tester.h


  xnn_subgraph* Subgraph() const { return subgraph_.get(); }

+  // Utility to help force inline LHS packing for the last convolution node (pf16)


AddConvolution2D has a flags parameter, can you just use that instead of adding this helper?

Also: adding an include of operator.h in this file is breaking the bazel build. Rather than fixing that, I'd rather just revert the changes in this file and use the flags parameter.

Hi @dsharlet i dropped out this change and went with setting the flags in the test. Also appologies for the break in the build, the strange thing is that this built on m4 no problem with bazel so I was unaware. I tested the latest change on ubuntu and m4, so should be good.

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> -- 06a44d2 by Jonathan Clohessy <[email protected]>: Refactor Convolution to new structure and fix build failures Signed-off-by: Jonathan Clohessy <[email protected]> -- 175903d by Jonathan Clohessy <[email protected]>: Remove unused gemm config structure init Signed-off-by: Jonathan Clohessy <[email protected]> -- 999f4e3 by Jonathan Clohessy <[email protected]>: Updated code with sme variants of kernels and fixed tests Signed-off-by: Jonathan Clohessy <[email protected]> -- a2bd7aa by Jonathan Clohessy <[email protected]>: Updated ifdef guards and yml file Signed-off-by: Jonathan Clohessy <[email protected]> -- 551cfde by Jonathan Clohessy <[email protected]>: Add new test case and fix issue with LHS pack Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm f62aea6 PiperOrigin-RevId: 833326167

…_igemm

…tead Signed-off-by: Jonathan Clohessy <[email protected]>

-- c69ccdb by Gian Marco Iodice <[email protected]>: Prototype: Add support for fp16 iGEMM with SME2 - Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]> -- a3537a1 by Gian Marco Iodice <[email protected]>: Include missing files Signed-off-by: Gian Marco Iodice <[email protected]> -- 232826c by Gian Marco Iodice <[email protected]>: Update FP16 iGEMM based on review comments Signed-off-by: Gian Marco Iodice <[email protected]> -- 03bccaa by Jonathan Clohessy <[email protected]>: Updated FP16 iGemm Review with Fixes Signed-off-by: Jonathan Clohessy <[email protected]> -- 9cd6e88 by Jonathan Clohessy <[email protected]>: Fix rebase issues Signed-off-by: Jonathan Clohessy <[email protected]> -- 7eb618d by Misha Gutman <[email protected]>: Added multiple_of to handle all multiples in reductions simply. No significant performance loss: bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6) bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6) bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6) bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6) bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6) bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6) bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6) bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6) bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6) bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6) bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6) bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6) bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6) bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6) bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6) bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6) PiperOrigin-RevId: 821549068 -- e5cb8c0 by Misha Gutman <[email protected]>: Changed K1_1 strategy for f32 to go with single accumulator and maximally long multiple, this significantly improved performance. Since contiguous case tiles became different from discontiguous changed the naming to not include tiles information. bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1% bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0% bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1% bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7% bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17% bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14% bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13% bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13% bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1% bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1% bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0% bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0% bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1% bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7% PiperOrigin-RevId: 821556723 -- aeeca5d by Dillon Sharlet <[email protected]>: Remove threadpool library and just build threadpool.cc as part of subgraph PiperOrigin-RevId: 821566586 -- 7304027 by Dillon Sharlet <[email protected]>: Disable SME when msan is enabled PiperOrigin-RevId: 821694771 -- 89a72e3 by Dillon Sharlet <[email protected]>: Don't bother disabling KleidiAI if using YNNPACK This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594 -- 0c5edfc by Dillon Sharlet <[email protected]>: Disable SME on older Apple compilers PiperOrigin-RevId: 821708108 -- 9b29972 by Dillon Sharlet <[email protected]>: Fix usage of `sv{ld,st}1_hor_vnum_za32` According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice). This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU. PiperOrigin-RevId: 821730217 -- 0d3dc09 by Dillon Sharlet <[email protected]>: Fix correctness of dot benchmarks for transpose_a kernels PiperOrigin-RevId: 821808685 -- 4b73eb1 by Pedro Gonnet <[email protected]>: Update `pthreadpool` dependency. PiperOrigin-RevId: 821857188 -- 66d084b by Dillon Sharlet <[email protected]>: Fix flaky quantize tests PiperOrigin-RevId: 821867761 -- 6fc5696 by Quentin Khan <[email protected]>: Add missing `gemm_config` `.element_size` initializations. PiperOrigin-RevId: 821984759 -- 923b7f9 by Jonathan Clohessy <[email protected]>: Fix build issues and guard against sme2 specific path Signed-off-by: Jonathan Clohessy <[email protected]> -- 06a44d2 by Jonathan Clohessy <[email protected]>: Refactor Convolution to new structure and fix build failures Signed-off-by: Jonathan Clohessy <[email protected]> -- 175903d by Jonathan Clohessy <[email protected]>: Remove unused gemm config structure init Signed-off-by: Jonathan Clohessy <[email protected]> -- 999f4e3 by Jonathan Clohessy <[email protected]>: Updated code with sme variants of kernels and fixed tests Signed-off-by: Jonathan Clohessy <[email protected]> -- a2bd7aa by Jonathan Clohessy <[email protected]>: Updated ifdef guards and yml file Signed-off-by: Jonathan Clohessy <[email protected]> -- 551cfde by Jonathan Clohessy <[email protected]>: Add new test case and fix issue with LHS pack Signed-off-by: Jonathan Clohessy <[email protected]> -- bcc62a0 by Jonathan Clohessy <[email protected]>: Removed ForceInlineLhsPackingPf16OnLastConv and use runtime flags instead Signed-off-by: Jonathan Clohessy <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm bcc62a0 PiperOrigin-RevId: 833326167

gmiodice and others added 4 commits October 20, 2025 09:24

Prototype: Add support for fp16 iGEMM with SME2

c69ccdb

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d Signed-off-by: Gian Marco Iodice <[email protected]>

Include missing files

a3537a1

Signed-off-by: Gian Marco Iodice <[email protected]>

Update FP16 iGEMM based on review comments

232826c

Signed-off-by: Gian Marco Iodice <[email protected]>

Updated FP16 iGemm Review with Fixes

03bccaa

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM mentioned this pull request Oct 20, 2025

Prototype: Add support for fp16 iGEMM with SME2 #8687

Open

Fix rebase issues

9cd6e88

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the f16_igemm branch from 77362fd to 9cd6e88 Compare October 20, 2025 09:44

copybara-service bot mentioned this pull request Oct 21, 2025

Copybara import of the project: #9017

Closed

Aelphy and others added 12 commits October 21, 2025 16:16

Remove threadpool library and just build threadpool.cc as part of sub…

aeeca5d

…graph PiperOrigin-RevId: 821566586

Disable SME when msan is enabled

7304027

PiperOrigin-RevId: 821694771

Don't bother disabling KleidiAI if using YNNPACK

89a72e3

This causes builds to fail, and it's harmless to leave it enabled. PiperOrigin-RevId: 821704594

Disable SME on older Apple compilers

0c5edfc

PiperOrigin-RevId: 821708108

Fix correctness of dot benchmarks for transpose_a kernels

0d3dc09

PiperOrigin-RevId: 821808685

Update pthreadpool dependency.

4b73eb1

PiperOrigin-RevId: 821857188

Fix flaky quantize tests

66d084b

PiperOrigin-RevId: 821867761

Add missing gemm_config .element_size initializations.

6fc5696

PiperOrigin-RevId: 821984759

Fix build issues and guard against sme2 specific path

923b7f9

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM force-pushed the f16_igemm branch from 6eb7ad7 to 56ee7cb Compare October 22, 2025 10:32

JonathanC-ARM force-pushed the f16_igemm branch from 56ee7cb to bf9d731 Compare October 22, 2025 10:45

Merge remote-tracking branch 'origin/master' into f16_igemm

22beb50

JonathanC-ARM force-pushed the f16_igemm branch from bf9d731 to 22beb50 Compare October 22, 2025 10:46

Merge branch 'google:master' into f16_igemm

9efa3d6

JonathanC-ARM added 2 commits October 29, 2025 14:06

Updated code with sme variants of kernels and fixed tests

999f4e3

Signed-off-by: Jonathan Clohessy <[email protected]>

Merge branch 'f16_igemm' of github.com:JonathanC-ARM/XNNPACK into f16…

892eee1

…_igemm

dsharlet approved these changes Oct 29, 2025

View reviewed changes

Updated ifdef guards and yml file

a2bd7aa

Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet approved these changes Oct 31, 2025

View reviewed changes

JonathanC-ARM closed this Nov 2, 2025

JonathanC-ARM reopened this Nov 2, 2025

copybara-service bot mentioned this pull request Nov 2, 2025

Clean up unnecessary codepaths from operator-run #9075

Open

Add new test case and fix issue with LHS pack

551cfde

Signed-off-by: Jonathan Clohessy <[email protected]>

Merge branch 'google:master' into f16_igemm

f62aea6

dsharlet reviewed Nov 17, 2025

View reviewed changes

copybara-service bot mentioned this pull request Nov 17, 2025

Copybara import of the project: #9163

Closed

JonathanC-ARM added 2 commits November 17, 2025 20:01

Merge branch 'f16_igemm' of github.com:JonathanC-ARM/XNNPACK into f16…

63377fe

…_igemm

Removed ForceInlineLhsPackingPf16OnLastConv and use runtime flags ins…

bcc62a0

…tead Signed-off-by: Jonathan Clohessy <[email protected]>

dsharlet approved these changes Nov 18, 2025

View reviewed changes

copybara-service bot merged commit 6dbb696 into google:master Nov 18, 2025
23 checks passed


		xnn_subgraph* Subgraph() const { return subgraph_.get(); }

		// Utility to help force inline LHS packing for the last convolution node (pf16)

Add support for fp16 iGEMM with SME2 #9005

Add support for fp16 iGEMM with SME2 #9005

Uh oh!

Conversation

JonathanC-ARM commented Oct 20, 2025

Uh oh!

dsharlet commented Oct 21, 2025

Uh oh!

JonathanC-ARM commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathanC-ARM commented Oct 21, 2025

Uh oh!

gonnet commented Oct 22, 2025

Uh oh!

JonathanC-ARM commented Oct 29, 2025

Uh oh!

dsharlet commented Oct 30, 2025

Uh oh!

JonathanC-ARM commented Oct 30, 2025

Uh oh!

dsharlet commented Nov 2, 2025

Uh oh!

dsharlet commented Nov 2, 2025

Uh oh!

JonathanC-ARM commented Nov 17, 2025

Uh oh!

dsharlet Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

dsharlet Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

JonathanC-ARM Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JonathanC-ARM commented Oct 21, 2025 •

edited

Loading

JonathanC-ARM Nov 18, 2025 •

edited

Loading